26
Quantization of Neural Networks
which is defined as
H(Qa(x)) = −
qx
p(qx) log p(qx) = 1
2 log 2πeσ2
x,
max H(Qa(x)) = n ln 2
2n , when p(qx) = 1
2n ,
(2.19)
where qx are the random quantized variables in Qa(x) (which is Qa(q) or Qa(k) under
different conditions) with probability mass function p(·). The information entropy in the
quantization process should be maximized to retain the information contained in the MHSA
modules from their full-precision counterparts.
However, direct application of a quantization function that converts values into finite
fixed points brings about irreversible disturbance to the distributions and the information
entropy H(Qa(q)) and H(Qa(k)) degenerates to a much lower level than its full precision
counterparts. To mitigate the information degradation from the quantization process in the
attention mechanism, an Information Rectification Module (IRM) is proposed to effectively
maximize the information entropy of quantized attention weights.
Qa(˜q) = Qa( q −μ(q) + βq
γq
σ2(q) + ϵq
),
Qa(˜k) = Qa( k −μ(k) + βk
γk
σ2(k) + ϵk
),
(2.20)
where γq, βq and γk, βk are the learnable parameters to modify the distribution of ˜q, while
ϵq and ϵk are constants that prevent the denominator from being 0. The learning rates of
the learnable γq, βq and γk, βk are the same as for the entire network. Thus, after IRM, the
information entropy H(Qa(˜q)) and H(Qa(˜k)) is formulated as
H(Q(˜q)) = 1
2 log 2πe[γ2
q(σ2
q + ϵq)], H(Q(˜k)) = 1
2 log 2πe[γ2
k(σ2
k + ϵk)].
(2.21)
Then, to revive the attention mechanism to capture critic elements by maximizing infor-
mation entropy, the learnable parameters γq, βq and γk, βk reshape the distributions of the
query and key values to achieve the maximum state of information. In a nutshell, in our
IRM-Attention structure, the information entropy of quantized attention weight is maxi-
mized to alleviate its severe information distortion and revive the attention mechanism.
2.3.4
Distribution Guided Distillation Through Attention
To address the attention distribution mismatch that occurred in the fully quantized ViT
baseline in backward propagation, we further propose a distribution-guided distillation
(DGD) scheme with apposite distilled activations and well-designed similarity matrices to
effectively utilize teacher knowledge, which optimizes fully quantized ViT more accurately.
As an optimization technique based on element-level comparison of activation, distilla-
tion allows the quantized ViT to mimic the full-precision teacher model about output logits.
However, we find that the distillation procedure used in the previous ViT and fully quan-
tized ViT baseline (Section 2.3.1) is unable to deliver meticulous supervision to attention
weights (shown in Fig. 2.6), leading to insufficient optimization. To solve the optimization
insufficiency in the distillation of the fully quantized ViT, we propose the Distribution-
Guided Distillation (DGD) method in Q-ViT. We first build patch-based similarity pattern
matrices for distilling the upstream query and key instead of attention following [226], which
is formulated as
˜Gl
qh = ˜ql
h · (˜ql
h)⊤, G(l)
qh = ˜Gl
qh/∥˜Gl
qh∥2,
˜Gl
kh = ˜kl
h · (˜kl
h)⊤, G(l)
kh = ˜Gl
kh/∥˜Gl
kh∥2,
(2.22)